Using Real Data to Compare DIF Detection and Effect Size Measures among
نویسندگان
چکیده
To date, many studies have been conducted to compare the performance of different DIF procedures using simulated data sets. However, some results from these simulation studies are inconsistent with one other (e.g., Hidalgo & López-Pina, 2004; Jodoin & Gierl, 2001). This study used real data to systematically investigate the consistencies of DIF detection and effect size among three widely-used DIF procedures: Mantel-Haenszel (MH), Simultaneous Item Bias Test (SIBTEST), and logistic regression (LR). Several indicators, including correlations among DIF procedures effect size measures, matching percentages, and relative matching percentages, were used to evaluate the consistencies among the DIF procedures. The results showed high correlations among DIF effect size measures, moderate to high matching percentages among DIF classifications, and a broad range of relative matching percentages among DIF procedures. Using Real Data to Compare DIF Detection and Effect Size Measures among Mantel-Haenszel, SIBTEST, and Logistic Regression Procedures Differential item functioning (DIF) is of great interest to researchers and educators given that DIF poses a potential threat to test fairness. A variety of DIF detection procedures and effect size measures have been proposed to quantify the magnitude of DIF, such as the IRT methods ( Lord, 1980; Thissen, Steinberg, & Wainer, 1993), the Mantel-Haenszel statistic (MH; Holland & Thayer, 1988), the standardization procedure (Dorans & Kullick, 1986), the Simultaneous Item Bias Test (SIBTEST; Shealy & Stout, 1993a), and Logistic Regression (LR; Swaminathan & Rogers, 1990). Often, multiple procedures are used simultaneously to help detect items showing DIF (Hambleton & Jones, 1994). As a result, high consistencies or matching percentages among DIF procedures are of consequence in real testing situation. To date, many studies have been conducted to compare the performance of different DIF procedures using simulated data sets (e.g., Fidalgo, Ferreres, & Muniz, 2004; Gierl, Jodoin, & Ackerman, 2000; Hidalgo & López-Pina, 2004; Jodoin & Gierl, 2001; Narayanan & Swaminnathan, 1994; Roussos & Stout, 1996). However, some results from these simulation studies are inconsistent with one other. For instance, Jodoin and Gierl (2001) reported that 68.2% of DIF items were identified as containing at least moderate DIF using logistic regression procedure in their simulation study. In contrast, Hidalgo and López-Pina (2004) found that, by using the same procedure and same classification guideline, only 15.3% of DIF items were classified as having moderate DIF. Therefore, the purpose of this study is to use real data to systematically investigate the consistencies of DIF detection and effect size measures among three widely-used DIF procedures: the MH procedure, the SIBTEST procedure, and the LR procedure. This paper is divided into four sections. First, a brief overview of the three DIF procedures is provided. Second, the methods used in this study for evaluating the consistencies among the three DIF procedures are described. Third, the results are presented. Forth, the implications and future directions are discussed. Overview of the three DIF procedures Mantel-Haenszel Mantel-Haenszel (MH) is one of the widely-used approaches for identifying DIF based on analysis of contingency tables (Clauser & Mazor, 1998; Holland & Thayer, 1988). In MH procedure, a chi-square test with one degree of freedom is yielded to test the null hypothesis that there is no relation between group membership and test performance on one item after controlling for ability. MH is computed by matching examinees in each group on their total test scores and then forming a 2-by-2-byK contingency table for each item, where K is the total number of score levels on the matching variable, namely, the total test score. At each score level , a 2-by-2 contingency table is created for each item , as shown in Figure 1. j i The MH chi-square test is calculated as follows:
منابع مشابه
Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression
Previous methodological and applied studies that used binary logistic regression (LR) for detection of differential item functioning (DIF) in dichotomously scored items either did not report an effect size or did not employ several useful measures of DIF magnitude derived from the LR model. Equations are provided for these effect size indices. Using two large data sets, the authors demonstrate ...
متن کاملA Reexamination of Lord’s Wald Test for Differential Item Functioning Using Item Response Theory and Modern Error Estimation
The detection of differential item functioning (DIF) is an essential step in increasing the validity of a test for all groups. The item response theory (IRT) model comparison approach has been shown to be the most flexible and powerful method for DIF detection; however, it is computationally-intensive, requiring many model-refittings. The Wald test, originally employed by Lord for DIF detection...
متن کاملMIMIC DIF Testing When the Latent Variable Variance Differs Between Groups
Multiple indicators multiple causes (MIMIC) models (Joreskog & Golberger 1975) can be employed in a psychometric context to test for differential item functioning (DIF) between groups on the measurement of a latent variable (Muthén 1989). MIMIC DIF models can be attributed some favorable properties when compared to alternative DIF testing methods (i.e., Item Response Theory-Likelihood Ratio DIF...
متن کاملDifferential item functioning procedures for polytomous items when examinee sample sizes are small
As part of test score validity, differential item functioning (DIF) is a quantitative characteristic used to evaluate potential item bias. In applications where a small number of examinees take a test, statistical power of DIF detection methods may be affected. Researchers have proposed modifications to DIF detection methods to account for small focal group examinee sizes for the case when item...
متن کاملA Comparison of Item Response Theory And Observed Score DIF Detection Measures For the Graded Response Model
This paper provides a review of procedures for detection of differential item functioning (DIF) for item response theory (IRT) and observed score methods for the graded response model. In addition, data from a test anxiety scale were analyzed to examine the congruence among these procedures. Data from Nasser, Takahashi, and Benson (1997) were reanalyzed for purposes of this study. The data were...
متن کامل